Week 7.5 - Building Your Data Analysis Workflow

What We'll Cover

Earlier in this week, we explored how AI can assist with data analysis tasks, examined the landscape of AI-assisted visualisation, and practised translating natural language into working code. Now we bring it all together into a practical, principled workflow that you can apply to your own research data.

This lesson is deliberately hands-on. We will walk through a five-stage analysis workflow, provide copy-and-paste prompt templates for common statistical tasks, discuss when AI genuinely helps and when established statistical packages are the better choice, introduce a Claude Code and Jupyter workflow for interactive analysis, and address the critical privacy and ethical considerations that arise when working with research data.

One principle underpins everything here: AI should accelerate your analysis, not substitute for your understanding. If you let an AI tool choose your statistical test, fit a model, and interpret the results without understanding each step yourself, you have not done analysis — you have outsourced it. The workflow below is designed to keep you in control at every stage.

🛠️ The Principled Data Analysis Workflow

Effective AI-assisted data analysis follows a consistent pattern: you lead with your research question and domain knowledge, AI helps with implementation and computation, and you verify and interpret everything. The five stages below form a cycle rather than a strict linear sequence — you will often return to earlier stages as your understanding of the data deepens. But skipping stages, particularly the first and last, is where most errors and misinterpretations originate.

Define the question clearly
Before writing any code or touching any data, articulate precisely what you are trying to learn. What is your research question? What variables are involved? What would a meaningful answer look like? What are your hypotheses, if any?

This sounds obvious, but it is the step most frequently shortchanged when AI tools are available. The temptation is to load data into a chatbot and say "analyse this" — but without a clear question, AI will give you an analysis, just not necessarily the right one. It may choose inappropriate tests, explore irrelevant relationships, or present findings that answer a question you never asked. A well-defined question constrains the analysis space and makes it far easier to evaluate whether the results are meaningful.

Write your question down. Be specific. "Is there a relationship between X and Y?" is better than "Analyse my data." "Does treatment A produce significantly different outcomes than treatment B, controlling for age and baseline severity?" is better still.
Prepare and understand your data
Before running any analysis, you need to understand what you are working with. How many observations do you have? What are the variable types (continuous, categorical, ordinal)? Are there missing values, and if so, how much and what pattern? Are there outliers? What does the distribution of each key variable look like?

AI tools are excellent at generating the code for data exploration — summary statistics, distribution plots, missing value reports, correlation matrices. But you must be the one who interprets what those summaries reveal. A correlation matrix generated by AI is only useful if you understand which correlations matter for your research question and which are incidental. A missing value report is only useful if you can judge whether the missingness is random or systematic, and what that means for your planned analysis.

This is also the stage where you handle data cleaning — recoding variables, handling duplicates, dealing with outliers, and transforming variables as needed. AI can write the cleaning code, but the decisions about how to clean (e.g., remove outliers vs. winsorise vs. keep them) must be yours, guided by your domain knowledge and your research question.
Analyse with AI assistance
Now bring in AI — but for specific, bounded tasks. Ask it to write the code for a particular statistical test you have already decided is appropriate. Ask it to implement a regression model with specific variables you have chosen. Ask it to generate a visualisation of a relationship you want to examine. The key word, as with AI-assisted writing, is bounded: you define the analytical task, AI implements it.

This is where prompt templates (covered in the next section) become invaluable. A well-structured prompt specifies your data structure, your research question, the test or model you want, and the output format you need. It does not say "do whatever you think is best" — it says "implement this specific analysis and explain each step."

When AI suggests an analytical approach you had not considered, treat it as a hypothesis to evaluate, not a recommendation to follow. Ask: why would this approach be appropriate for my data? What assumptions does it make? Does my data meet those assumptions? If you cannot answer these questions, you need to learn more before proceeding.
Verify everything
This is the stage that separates rigorous research from AI-generated output. Every result must be checked. Does the code run without errors? Do the results make sense given what you know about the data? Are the statistical assumptions met? Do sanity checks pass — for example, if you know the approximate mean of a variable, does the computed mean match?

Verification has several layers. Code verification: read the code AI generated and confirm it does what you intended. Watch for common errors — wrong column names, incorrect grouping variables, inappropriate handling of missing values. Statistical verification: check that assumption tests were run and passed (normality, homoscedasticity, independence). Result verification: compare key results against rough manual calculations or known benchmarks. If AI reports a correlation of 0.95 between two variables you would expect to be weakly related, something is wrong.

A particularly important check: run the same analysis using an established statistical package (R, Stata, SPSS) on a subset of your data. If the results differ from what AI produced, investigate why before trusting either.
Interpret with your expertise
AI can tell you that a coefficient is statistically significant at p < 0.01. It cannot tell you whether that coefficient is meaningful in your research context. It cannot tell you whether the effect size matters practically. It cannot tell you how your findings relate to the existing literature, what the theoretical implications are, or what the limitations of your particular study design mean for how the results should be qualified.

Interpretation is where your domain expertise is irreplaceable. A statistically significant finding may be trivial. A non-significant finding may be important (perhaps your study was underpowered, or the null result challenges a prevailing assumption). The relationship between statistical output and research conclusions is mediated by theory, context, and judgement — all of which are yours, not the AI's.

Be especially cautious about AI-generated interpretations. When you ask an AI tool to "interpret" your results, it will produce plausible-sounding narratives that may not reflect the actual state of your field. It may cite non-existent literature, draw connections to theories it has fabricated, or offer explanations that sound authoritative but are not grounded in real disciplinary knowledge. Use AI to compute results. Interpret them yourself.

The core principle: AI is most valuable in Stages 2, 3, and 4 — writing code for data exploration, implementing statistical analyses, and running verification checks. It is least valuable (and most dangerous) if used to replace Stage 1 (defining the question) or Stage 5 (interpreting results with domain expertise). Protect your research questions. Protect your interpretations. Let AI handle the implementation in between.

💬 Prompt Templates for Data Analysis

The following prompt templates are designed for common data analysis tasks. Each one is structured to keep you in control — you specify the data, the question, and the approach, and AI provides the implementation. Adapt these to your own data and research context. Notice that every prompt asks for explanations alongside code, so you can verify that each step is doing what you expect.

📊 Loading and Exploring Data

The first step in any analysis is understanding what your dataset contains. This prompt generates a comprehensive exploratory data analysis (EDA) that reveals the structure, distributions, and potential issues in your data before you begin modelling.

PROMPT — Data Exploration I have a dataset in [format: CSV / Excel / Stata .dta / etc.] with [approximate number] observations and the following variables: [List your key variables with brief descriptions, e.g.: - age (continuous, years) - treatment_group (categorical: control, low_dose, high_dose) - outcome_score (continuous, 0-100 scale) - gender (categorical: M/F/Other) - baseline_score (continuous, 0-100 scale)] Please write Python code to: 1. Load the data and display the first few rows 2. Show data types, missing value counts, and basic summary statistics 3. Plot the distribution of each continuous variable (histograms or KDE plots) 4. Show frequency tables for each categorical variable 5. Create a correlation matrix for continuous variables 6. Flag any potential issues (extreme outliers, high missingness, unexpected data types) Use pandas, matplotlib, and seaborn. Add comments explaining what each section does and what I should look for in the output.

Why this works: By listing your variables and their types upfront, you prevent AI from guessing (and guessing wrong). The request for comments and explanations ensures you understand what the code does, rather than blindly running it. The "flag potential issues" instruction leverages AI's pattern-matching ability while keeping the decision about how to handle those issues with you.

🔬 Statistical Testing

When you have a specific hypothesis to test, this template helps you implement the appropriate statistical test with proper assumption checking. The key is that you decide which test is appropriate — the prompt asks AI to implement it, not to choose it for you.

PROMPT — Statistical Testing I want to test whether [state your specific hypothesis, e.g., "the mean outcome_score differs significantly between treatment groups"]. My data has: - Dependent variable: [name and type, e.g., "outcome_score, continuous, range 0-100"] - Independent variable: [name and type, e.g., "treatment_group, categorical with 3 levels"] - Sample size: [approximate N per group] - Potential confounders: [list any, or "none identified"] I plan to use [name the test you believe is appropriate, e.g., "one-way ANOVA"]. Please write Python code that: 1. Checks the assumptions for this test (normality, equal variances, independence) 2. Reports the results of each assumption check with clear pass/fail interpretation 3. Runs the test if assumptions are met 4. If assumptions are violated, suggests and implements an appropriate non-parametric alternative 5. Reports effect sizes alongside p-values 6. Creates a visualisation of the group comparison Explain each step so I can verify the logic. Use scipy.stats and pingouin where appropriate.

Why this works: You state the hypothesis and propose a test before AI touches the data. This forces you to think through the analysis design. The assumption-checking steps are critical — AI will run the test regardless of whether assumptions are met unless you explicitly ask it to check. The request for effect sizes alongside p-values reflects modern best practice in statistical reporting. If the test choice was wrong, the assumption checks will reveal it, and you can then make an informed decision about alternatives.

📈 Regression Analysis

Regression models are among the most common analytical tools in research. This template covers the full pipeline from model specification through diagnostics to interpretation.

PROMPT — Regression I want to model the relationship between [outcome variable] and [list predictor variables]. Context: [brief description of your research question and why these predictors are theoretically relevant] Data details: - Outcome: [variable name, type, distribution description] - Predictors: [list each with type and expected direction of effect] - Sample size: [N] - Potential issues: [e.g., "some multicollinearity expected between X1 and X2", "outcome is right-skewed", "clustered data from 10 schools"] I plan to use [model type, e.g., "multiple linear regression" / "logistic regression" / "mixed-effects model"]. Please write Python code that: 1. Fits the model with the specified variables 2. Displays full model summary with coefficients, standard errors, confidence intervals, and p-values 3. Runs diagnostic checks (residual plots, VIF for multicollinearity, influence measures) 4. Tests key assumptions (linearity, normality of residuals, homoscedasticity) 5. Creates diagnostic visualisation plots 6. If any diagnostics suggest problems, explain what the problem means and suggest remedies Use statsmodels for the regression. Explain each diagnostic and what I should look for in the output.

Why this works: By providing theoretical context ("why these predictors are relevant"), you anchor the analysis in your domain knowledge rather than letting AI fish for significant predictors. Specifying known issues (multicollinearity, skewness, clustering) demonstrates that you understand your data and ensures AI addresses these rather than ignoring them. The diagnostic suite catches problems that would otherwise go undetected in a quick AI-generated analysis.

📅 Time Series Analysis

Time series data requires specialised handling. This template covers the key steps from decomposition through modelling to forecasting, with appropriate attention to the temporal structure of the data.

PROMPT — Time Series I have time series data for [describe what is measured, e.g., "monthly GDP growth rates"] spanning [time period, e.g., "January 2010 to December 2024"] with [frequency, e.g., "monthly"] observations. Research question: [e.g., "Is there a structural break after the policy change in March 2020?" or "Can we forecast the next 12 months?"] Data details: - Time variable: [column name and format] - Value variable(s): [column name(s)] - Known events or structural changes: [list any, e.g., "COVID lockdown March 2020, policy change June 2022"] - Suspected seasonality: [yes/no, and if yes, what period] Please write Python code that: 1. Plots the raw time series with any known events marked 2. Tests for stationarity (ADF test and KPSS test) 3. Decomposes the series into trend, seasonal, and residual components 4. Plots the ACF and PACF to inform model selection 5. Fits an appropriate model based on the diagnostics (explain your choice) 6. Runs residual diagnostics (Ljung-Box test, residual ACF) 7. If forecasting is requested, produces forecasts with confidence intervals Use statsmodels. Explain each step and what the output tells us about the data's temporal structure.

Why this works: Time series analysis is particularly prone to AI errors because the temporal structure introduces dependencies that standard methods ignore. By specifying known events and suspected seasonality, you provide domain knowledge that AI cannot infer from the data alone. The dual stationarity tests (ADF and KPSS) catch cases where one test alone would give a misleading answer. The requirement to explain model choice ensures transparency rather than a black-box forecast.

⚖️ When to Use AI vs. Established Statistical Packages

AI-assisted analysis is powerful, but it is not always the best tool. Understanding when to use AI and when to reach for established statistical software is a mark of analytical maturity. The decision depends on the complexity of the task, the stakes of the analysis, the reproducibility requirements, and your own statistical fluency.

Where AI Excels

⚡ Rapid Prototyping

When you need to quickly explore a dataset, try multiple visualisations, or test whether an analytical approach is feasible before committing to it. AI can generate exploratory code in seconds that would take you minutes or hours to write from scratch.

🔧 Code Translation

When you have working code in one language (e.g., R) and need it in another (e.g., Python), or when you need to convert between frameworks (e.g., from base R to tidyverse). AI handles syntax translation well, though you should always verify the output.

📚 Learning New Methods

When you want to implement a method you have read about but never coded. AI can generate annotated example code that serves as a learning scaffold — much faster than piecing together documentation and Stack Overflow answers.

🔎 Debugging and Error Resolution

When your code throws an error you cannot decipher, or when results are unexpected. Pasting error messages or anomalous output into an AI tool often yields faster diagnosis than searching forums, especially for obscure library-specific issues.

Where Established Packages Are Better

🔐 Publication-Ready Analysis

For analyses that will appear in a paper, use established packages with known, validated implementations. R packages like lme4, survival, or lavaan, or Stata's built-in commands, have been tested by thousands of researchers. You can cite them. Reviewers trust them. AI-generated code has no such pedigree.

📈 Complex Model Fitting

Bayesian models, structural equation models, multilevel models with complex random-effects structures, and other advanced methods are better handled by dedicated packages (e.g., brms in R, Mplus, Stan). AI may generate code that runs but implements the model incorrectly in subtle ways.

📋 Reproducibility Requirements

When your analysis must be exactly reproducible (e.g., for preregistered studies or regulatory submissions), established packages with version-locked dependencies are essential. AI-generated code may use deprecated functions, assume specific package versions, or produce non-deterministic results.

✅ Validated Pipelines

Clinical trials, regulatory analyses, and any context with legal or safety implications require validated statistical software. AI-generated analysis code has not been validated and cannot be used where validation is required.

Here is one view of how good AI is at certain data tasks, though many of these have moved into the excellent category already.

Task	AI-Assisted	Established Package	Recommendation
Exploratory data analysis	Excellent	Good	Use AI for speed, verify key findings
Data cleaning and wrangling	Good	Good	AI writes code, you make decisions
Standard hypothesis tests	Good	Excellent	Use AI to write code, cross-check with package
Regression (basic)	Good	Excellent	AI for prototyping, final analysis in package
Regression (complex/multilevel)	Moderate	Excellent	Use established packages; AI for setup help only
Visualisation	Excellent	Good	AI excels here; iterate quickly
Time series (standard)	Good	Excellent	AI for exploration, package for final models
Machine learning	Good	Good	AI for code scaffolding, `scikit-learn` for execution
Bayesian analysis	Moderate	Excellent	Use `Stan`/`brms`; AI for priors discussion only
Clinical trial analysis	Not suitable	Required	Validated software only; no AI in pipeline

The practical hybrid approach: For most research analyses, the optimal workflow uses AI for the exploratory and prototyping phases (Stages 1-3 of the workflow), then moves to established statistical packages for the final, publication-ready analysis. AI writes the first draft of your code. You verify it, refine it, and then reimplement the critical analyses using validated tools. This gives you the speed of AI and the reliability of established packages.

💻 The Claude Code and Jupyter Workflow

One of the most powerful ways to use AI for data analysis is through Claude Code — Anthropic's command-line tool that allows Claude to read, write, and execute code directly in your working environment. When combined with Jupyter notebooks, this creates an interactive analysis workflow where AI can see your data, run code, inspect outputs, and iterate on analyses — all within a documented, reproducible environment.

What is Claude Code? Claude Code is a command-line tool that gives Claude direct access to your filesystem and terminal. Unlike a chatbot conversation where you copy and paste code back and forth, Claude Code can read your data files, write scripts, execute them, see the results, and fix errors — all in a continuous workflow. Think of it as pair-programming with an AI that can also run the code it writes.

Setting Up the Workflow

The following workflow is adapted from Patrick Mineault's guide for scientists using Claude Code (Mineault, 2026). The core idea is to use a Jupyter notebook as your analysis document and Claude Code as your coding assistant, creating a tight feedback loop between you and the AI.

Set up your project structure
Create a clean project directory with your data, a Jupyter notebook, and a CLAUDE.md file that describes your project context. The CLAUDE.md file is particularly important — it tells Claude about your research question, your data structure, and your analytical conventions, so you do not have to repeat this context in every prompt.

# Example project structure my-analysis/ data/ raw/ survey_responses.csv processed/ cleaned_survey.csv notebooks/ analysis.ipynb CLAUDE.md requirements.txt
Write a CLAUDE.md file for your project
This file provides persistent context that Claude reads at the start of every interaction. It should describe your data, your research questions, and any conventions you want Claude to follow.

# CLAUDE.md example # Project: Survey Analysis for Health Outcomes Study ## Research Question Does community health worker (CHW) contact frequency predict patient adherence to antiretroviral therapy (ART), controlling for distance to clinic and socioeconomic status? ## Data Description - Source: Clinic records from 5 facilities, 2022-2024 - N = 1,247 patients - Key variables: adherence_score (0-100), chw_visits (count), clinic_distance_km, ses_quintile (1-5), age, gender ## Conventions - Use Python with pandas, statsmodels, and seaborn - All plots should use the seaborn "muted" palette - Report 95% confidence intervals alongside p-values - Use significance threshold of 0.05 but always report exact p - Comments in English, variable names in snake_case
Use Claude Code interactively with Jupyter
Launch Claude Code in your project directory, then ask it to work with your Jupyter notebook. Claude can read existing cells, add new ones, run the notebook, and inspect outputs — creating a conversational analysis experience.

# In your terminal, from the project directory: claude # Then interact naturally: > Load the cleaned survey data and give me an overview of the distributions and missing values. # Claude reads CLAUDE.md, writes code in the notebook, # executes it, and reports what it finds. > The adherence_score distribution looks bimodal. Can you investigate whether this splits along treatment duration? # Claude adds new cells, creates visualisations, # and discusses findings.
Review, verify, and iterate
After each analysis step, open the Jupyter notebook in your browser to review the code and outputs. Check that the code does what you intended. Verify key numbers against your expectations. Then continue the conversation with Claude Code for the next analytical step.

Why Jupyter + Claude Code? The combination provides three advantages over a standard chatbot conversation: (1) Claude can see and work with your actual data files, catching issues that would be invisible in copy-pasted snippets. (2) The notebook preserves a complete, executable record of your analysis — every step is documented and reproducible. (3) The iterative workflow mirrors how real data analysis works — explore, discover something unexpected, investigate further, refine your approach.

📖 Key Reference: Claude Code for Scientists

Patrick Mineault (2026) provides a detailed guide to using Claude Code for scientific data analysis, including setup instructions, best practices for CLAUDE.md files, and worked examples from neuroscience research.

Mineault, P. (2026). Claude Code for Scientists. NeuroAI.

⚠️ Important limitations of Claude Code for data analysis:

Claude Code sends information to Anthropic's servers. Do not use it with data that cannot leave your institution (see Privacy section below).
Claude may modify files in your project directory. Always use version control (Git) so you can revert unwanted changes.
The interactive workflow is excellent for exploration but may not meet reproducibility standards for final analyses. Export your final analytical pipeline as a standalone script.
Claude Code's context window, while large, has limits. Very large datasets should be summarised or sampled before passing to Claude.

🔒 Privacy and Data Considerations

When AI tools enter the data analysis workflow, privacy and data governance become critical concerns. Unlike traditional statistical software that runs entirely on your local machine, AI-assisted analysis may involve sending data to external servers, storing it in third-party systems, or exposing it to models trained on diverse internet data. As a researcher, you have ethical and often legal obligations to protect your participants' data.

⚠️ CRITICAL: Ethics Approval and Sensitive Data

If your research involves human participants, your ethics approval almost certainly includes conditions about how data is stored, processed, and shared. Using an AI chatbot or cloud-based AI tool to analyse participant data may violate your ethics approval unless this was explicitly included in your ethics application and consent forms.

Before using any AI tool with research data, ask yourself:

Does your ethics approval cover AI-assisted analysis? If your approved protocol says data will be stored on university servers and analysed using SPSS, sending that data to Claude or ChatGPT is a protocol violation. You may need an amendment.
Does your consent form mention AI processing? Participants consented to specific uses of their data. AI processing may not be covered. This is particularly important for qualitative data (interview transcripts, open-ended survey responses) where re-identification risks are higher.
Is the data identifiable? Even "anonymised" data may be re-identifiable when combined with other information. AI tools may retain data in logs or training sets. If there is any doubt, do not upload identifiable data to AI services.
What are your institutional data governance requirements? UCT, like most universities, has data governance policies that specify where data can be stored and processed. Check with your faculty's research ethics committee or data governance office if unsure.

When in doubt, the safest approach is to use AI tools only with fully anonymised data, synthetic data, or summary statistics — never with raw participant data containing any identifying information.

Practical Strategies for Data Privacy

🛡️ Anonymise Before Analysis

Remove all direct identifiers (names, ID numbers, addresses, dates of birth) before any AI tool touches the data. Replace them with anonymous codes. Do this in a local environment before uploading or sharing with AI.

🎲 Use Synthetic Data for Prototyping

Create a synthetic dataset with the same structure and statistical properties as your real data. Develop and test your analysis pipeline on the synthetic data using AI, then run the finalised pipeline on the real data locally.

💻 Prefer Local Execution

Where possible, use AI to generate code rather than analyse data directly. Write prompts that describe your data structure without including actual data. Run the generated code locally on your machine where the data stays.

📜 Check Data Retention Policies

Different AI services have different data retention policies. Some retain conversations and uploaded data for model training. Others offer zero-retention options (e.g., Anthropic's API with zero-retention). Know your tool's policy before using it with any research data.

The South African context: South Africa's Protection of Personal Information Act (POPIA) applies to the processing of personal information, which includes sending data to AI services. If your research data contains personal information as defined by POPIA, using cloud-based AI tools for analysis may constitute cross-border transfer of personal information, which requires specific legal grounds. Consult your institution's POPIA compliance officer if you are working with personal data.

📚 Readings and Resources

The following resources provide deeper guidance on using AI for data analysis in research contexts. Both are practically oriented and written for researchers who are not software engineers.

📖 Core Reading

Mineault, P. (2026). Claude Code for Scientists. NeuroAI.

A detailed walkthrough of setting up Claude Code for scientific research, including project structure, CLAUDE.md conventions, and practical examples from neuroscience. Essential reading for anyone planning to use the Claude Code + Jupyter workflow described above.

📖 Core Reading

Dataquest (2025). Getting Started with Claude Code for Data Scientists.

A step-by-step guide to using Claude Code for data analysis workflows, covering installation, project setup, and practical tips for data scientists transitioning from chatbot-based AI interaction to a code-first approach.

Supplementary resources: If you want to explore further, the documentation for Claude Code provides technical setup instructions, and the Jupyter documentation covers notebook best practices. For privacy considerations specific to African research contexts, consult your institution's research ethics office and the Information Regulator (South Africa) guidance on POPIA compliance.

Summary & Looking Ahead

In this lesson, we assembled a complete workflow for AI-assisted data analysis — from defining your research question through data preparation, AI-assisted analysis, verification, and expert interpretation. The five-stage process is designed to keep you in intellectual and methodological control while making the most of what AI tools can offer.

Key takeaways:

Define your question first: AI should implement analyses you have designed, not choose analyses for you. The most common mistake is loading data into an AI tool without a clear question and accepting whatever analysis it produces.
Use prompt templates for bounded tasks: Structured prompts for data exploration, statistical testing, regression, and time series keep AI focused on specific tasks you have defined, with explanations you can verify.
Know when AI helps and when established packages are better: AI excels at exploratory analysis, visualisation, and code generation. For publication-ready results, complex models, and validated analyses, use established statistical packages.
The Claude Code + Jupyter workflow creates a powerful interactive analysis environment, but requires careful attention to data privacy and reproducibility.
Protect your participants' data: Ethics approval, consent forms, POPIA, and institutional policies all constrain how you can use AI tools with research data. When in doubt, use anonymised or synthetic data.
Interpret with your expertise: AI can compute results. Only you can determine what those results mean in your research context.

Remember that this workflow, like the writing workflow from Week 6, is meant to be adapted. Some researchers will use AI primarily for code generation and run everything locally. Others will embrace the Claude Code interactive workflow for exploratory phases and switch to established packages for final analyses. The right approach is the one that produces rigorous, reproducible, ethically sound research.

Next session: In Sub-Lesson 6, we move to hands-on activities and assessment. You will apply the techniques from this week to a real dataset, build an analysis pipeline using the workflow and prompt templates covered here, and complete the weekly assessment. Come prepared with a dataset from your own research (anonymised if it involves human participants) or use one of the practice datasets we will provide.